Prediction of type 2 diabetes mellitus using hematological factors based on machine learning approaches: a cohort study analysis

Type 2 Diabetes Mellitus (T2DM) is a significant public health problem globally. The diagnosis and management of diabetes are critical to reduce the diabetes complications including cardiovascular disease and cancer. This study was designed to assess the potential association between T2DM and routinely measured hematological parameters. This study was a subsample of 9000 adults aged 35–65 years recruited as part of Mashhad stroke and heart atherosclerotic disorder (MASHAD) cohort study. Machine learning techniques including logistic regression (LR), decision tree (DT) and bootstrap forest (BF) algorithms were applied to analyze data. All data analyses were performed using SPSS version 22 and SAS JMP Pro version 13 at a significant level of 0.05. Based on the performance indices, the BF model gave high accuracy, precision, specificity, and AUC. Previous studies suggested the positive relationship of triglyceride-glucose (TyG) index with T2DM, so we considered the association of TyG index with hematological factors. We found this association was aligned with their results regarding T2DM, except MCHC. The most effective factors in the BF model were age and WBC (white blood cell). The BF model represented a better performance to predict T2DM. Our model provides valuable information to predict T2DM like age and WBC.


Methods
Participants. The participants were recruited from the baseline of the Mashhad Stroke and Heart Atherosclerotic Disorders (MASHAD) study, Mashhad, north-eastern Iran, following a similar research protocol 10 . Nine thousand seven hundred four (9704) individuals aged 35-65 years were enrolled regarding their T2DM status were studied from the baseline of this cohort. T2DM was defined as a fasting blood glucose (FBG) ≥ 126 mg/dl or being treated with available oral hypoglycemic medications or insulin. Also, we consider triglyceride-glucose (TyG) index for the diagnosis of T2DM that defined as follows 11 : Also, we categorize the TyG index by using the median of our data. The median of TyG index in our data is 8.831. The inclusion criteria were males and females between the age of 35 and 65 years. We are dealing with data that is unbalanced (Diabetic vs. Non-Diabetic) in this investigation. One of the approaches that can be used for solving this problem is Synthetic Minority Oversampling Technique (SMOTE) 12 . The SMOTE algorithm is one of the most widely used under sampling and over sampling methods that create synthetic minority class samples. Therefore, in this study, the SMOTE algorithm was used to balance the classes. The observations were then analyzed on a balanced data set and after cleaning the data in each of the measured variables, finally with 9000 observations. After the cleaning data, we used the data from 9000 individuals in this study ( Fig. 1).
At the beginning of this study, we measured the demographic characteristics (including gender and age) and www.nature.com/scientificreports/ lected in 20 ml vacuum tubes and centrifuged for 30-45 min to separate the serum and plasma, and later sent to Bu Ali Research Institute, Mashhad, for laboratory examinations. Aliquots of serum were also kept frozen at -80 C for future analysis. The details of laboratory measurements and cut-offs are explained in the baseline report of the MASHAD cohort study 10 .

Statistical analysis and model building.
To describe the quantitative and qualitative variables, mean ± SD and frequency (%) were reported, respectively. Chi-square and Fisher's exact tests were applied to measure the association between categorical variables. Also, the mean of quantitative variables between the two groups were compared by independent T test. In addition, machine learning techniques such as logistic regression (LR) and decision tree (DT) algorithms have been used to analyze data. In fact, we applied these algorithms to deduce the association between T2DM and hematological factors. We considered two models for the prediction of T2DM. Model I investigated the association of T2DM with hematological factors and Model II investigated the association of the TyG index with hematological factors. All analysis were performed using SPSS version 22 (Armonk, NY: IBM Corp.) and SAS JMP Pro (SAS Institute Inc., Cary, NC) at the significant level of 0.05.

Logistic regression (LR) modeling. Logistic
Regression is a popular model to evaluate the relationship between various predictor variables (either categorical or continuous) and binary outcomes in medicine, public health, etc. 13 . Let Y i denotes the response variable and takes the values of 0 or 1 depending on whether response occurs or not. Also, X be vectors of covariates associated with response variable, β is the corresponding vectors of regression coefficients. So, the association between the covariates and binary response variable can be investigated as follows: Decision tree (DT) modelling. Machine learning is one of the artificial intelligence analyses that emerged in the late twentieth century 14,15 . In other words, machine learning is a process for extracting hidden knowledge in large data sets. One of the important problems for researchers in this process is data classification 16 . There are different techniques for classification problems 16 . DT can be applied in various applications in the medical field [17][18][19] . Due to the simplicity in understanding and clarity and extracting simple and understandable rules, it is widely applied and studied in these fields 16 . The DT consists of components, nodes, and branches. So that, there are three types of nodes: (1) a root node represents the result of subdividing all records into two or more exclusive subsets. (2) The internal nodes represent a possible point in the tree structure connected to the root node from the top and the leaf nodes from the bottom. (3) Leaf nodes that show the tree's final results in dividing records into target groups. Branches in the tree indicate the chance of placing records in target groups that emanate from the root node and the internal nodes 14,15 . DT algorithm uses the Gini impurity index for selecting the best variable.
where P i is the probability that a record in D belongs to the class C i and is estimated by | C i ,D|/|D. Logistic regression or LR is a statistical model applied to modeling dichotomous targets and investigating the effect of explanatory variables on the dichotomous target variable. In LR, the probability of placing each of the records in the target groups is also presented 20,21 . The main advantage of using the LR is that it can provide a good direct or inverse relationship between the inputs or explanatory variables and the target. It is also a flexible method 22 . Bootstrap forest (BF) modeling. BF platform fits an ensemble model by averaging several decision trees, each of which is fit to a bootstrap sample of the training data. Each split in each tree shows a random subset of the predictors. In this way, many weak models are combined to produce a stronger model. The final prediction for an observation is the average of the predicted values for that observation over all the decision trees. In fact, the BF determines the significant factors associated with diabetes.
Receiver operating characteristic (ROC) curves were used to evaluate the accuracy, precision, and specificity for all three algorithms. Also, the confusion matrix of the three algorithms were given. Ethics approval. All the participants consented to take part in the study by signing written informed consent. The study protocol was reviewed and all methods are approved by the Ethics Committee of Mashhad University of Medical Sciences with approval number IR.MUMS.REC.1399.660. All methods were carried out in accordance with relevant guidelines and regulations.

Results
A total of 9000 complete datasets of participant were analyzed in this cohort study (N = 4500 with Diabetes [female 62.77% vs male 37.22%], N = 4500 without Diabetes [female 59.15% vs male 40.84%]). The main baseline characteristics of the study population are summarized in Table 1. All the variables were significantly different between the two groups, including age, WBC, PDW, RDW, RBC, sex, PLT, MCHC, and HCT (P < 0.05). According to previous studies on the positive relationship of the TyG index with the presence of T2DM, we also considered the association of the TyG index with the hematological factors 11,23,24 . www.nature.com/scientificreports/ Three machine learning techniques were used to investigate the relationship between hematological predictors and binary response variables (diabetic, and non-diabetic). So, the main objective of this study was to anticipate diabetes using the LR, DT, and BF models and to determine their associated factors, especially hematological markers. For this purpose, the dataset was randomly split into two parts: training data, and test data (75% vs 25%). The training dataset was utilized to develop the DT and BF models, which was then validated using test data (25%) that hadn't been used during training. LR model. Results from the multiple LR model revealed that all variables were significantly associated with having of diabetes (P < 0.05). In other words, our findings after adjusting the effect of other variables in the Model I presented that the odds of having diabetes in males is 0.69 times than of females (P < 0.05). Also, after adjusting the effect of other variables for each increasing in age, the odds of having diabetes raises by 8 percent (P 0.05). Among the analyzed hematological variables, age (OR = 1.08, 95%CI = (1.07,1.08)), WBC (OR = 1.29, 95%CI = (1.24,1.33)), and PDW (OR = 1.11, 95%CI = (1.08,1.14)), had the greatest associations with having of diabetes, especially WBC because for each unit increase in WBC, the odds of having diabetes increases by 29 percent (P < 0.001) ( Table 2 Model I). Also, our findings after adjusting the effect of other variables in the Model II presented that the odds of having high TyG index in males is 0.66 times than of females (P < 0.05). Also, after adjusting the effect of other variables for each increasing in age, the odds of having high TyG index raises by 7 percent (P < 0.05). Among the analyzed hematological variables, age (OR = 1.07, 95%CI = (1.06, 1.08)), RBC (OR = 1.74, 95%CI = (1.36, 1.38)), WBC (OR = 1.33, 95%CI = (1.28,1.38)), and PDW (OR = 1.08, 95%CI = (1.05,1.12)), had the greatest associations with having high TyG index, especially WBC because for each unit increase in WBC, the odds of having high TyG index increases by 33 percent (P < 0.001) ( Table 2 Model II).
For comparison models the confusion matrices of the models I and II are given in Table 4. Moreover, Fig. 2  DT model. Figures S1 and S2 in Supplementary Information file illustrates the outcomes of the DT training for hematological factors. The DT algorithm determined the various diabetes risk factors and categorized them into 5 layers. According to the DT model, the first variable (root) is of the utmost significance for classifying data, with the subsequent variables having the subsequent levels of significance 25 . Figures S1 and S2 in Supplementary Information file illustrates that WBC, followed by age and RDW, has the greatest impact on the diabetes presence risk for models I and II.
In Model I participants with age < 47, WBC < 5.9, and RDW ≥ 41.2 had lower diabetes, according to the DT model, than those with higher WBC and RDW levels and older ages (0.8793 vs. 0.1207 incident rate). Eighty percent of patients had diabetes in the subgroup with older age (> = 47), low RDW (41.7), and high WBC (> = 6.8).  Table 4. Moreover, Fig. 2 (e) and (f) show the ROC curves of the models I and II. As shown in Table 4 the accuracy of the models I and II are 83.33 and 97.43 percent. Furthermore, the important variables associated with T2DM based on BF algorithm are given as: Age, WBC, PLT, RBC, RDW, PDW, HCT, MCHC, and Sex in model I and Age, WBC, RBC, HGB, RDW, PDW, PLT, HCT, MCHC, and Sex in model II. As one can observe Age, and WBC were the most significant factors which equal to the obtained results from LR and DT models. We summarize this study in a graphical abstract in Fig. 3.

Discussion
In this study, a large number of biological and hematological factors like age, WBC, PDW, RDW, RBC, Sex, PLT, MCHC, and HCT had a significant relationship with T2DM. As we mentioned previously, we considered the association of the TyG index with hematological factors because of its positive relationship with T2DM presence. We found that the association of hematological factors with the TyG index was aligned with their results regarding T2DM, except MCHC. Therefore, we will continue the discussion based on the results of the T2DM and hematological factors. The most important and effective factors associated with T2DM presence were found to be age (as the most important and significant factors in the first line of DT) and WBC (as the second factor). We found that in people over age of 47, the risk of diabetes increased dramatically. In line with our study, one study conducted in western Algeria on a sample of 1852 subjects, get these results with age 50 25 . In another study, the researchers indirectly found that the prevalence of T2DM was higher in middle-aged patients than in younger patients 26 . Contrary to our findings, a study on 307 diabetics showed that age had no significant relationship with the incidence or prevalence of diabetes 1 .
Our findings show that the WBC may be associated with the presence of T2DM. In people with a WBC ≥ 5.4, the prevalence of diabetics was 4 times more than of non-diabetics. Similarly, Lindsay et al. found that high WBC has the protentional to be considered as T2DM after adjusting for age and sex 31 . Another study conducted in 2018 showed that high WBC count, a marker of subclinical inflammation, can be used as an indicator of T2DM due to obesity 32 .
One of the most important difficulties for diabetics is the increased risk of thrombotic events and coagulation problems 33 . Platelets, are the main cellular element of coagulation, and play an important role in this process, and disruption in their number, shape, and activation pathways (measured by PT and MPV criteria) can lead to www.nature.com/scientificreports/ coagulation problems. The results of our study indicated a direct association between PLT count and the risk of diabetes. Conversely to our findings, the results obtained from a study of 1852 Algerian subjects with 1059 type 2 diabetic patients showed negative effect of PLT on the onset of T2DM 25 and Some studies just showed that PLT levels are not involved in the development of diabetes pathology 34,35 . The association between PLT and MPV and their effects on each other has been investigated and confirmed in other studies, but surprisingly we could   Figures (a, c and e) show the ROC curves for LR, DT, and BF algorithms in model I. Also, figures (b, d and f)  www.nature.com/scientificreports/ not find any significant association between MPV and the incidence of diabetes. Similarly, a number of studies could not find any association [36][37][38][39] , but some have found conflicting results with showing positive effects [40][41][42] . Table 3. Detailed rules based on DT in models I and II.  43 .Nevertheless, a study conducted in China in 2018 shows a direct link between RDW and the incidence of diabetes 44 .
We also found that HCT was negatively associated with the presence of diabetes, and a 2020 study in Northwest Ethiopia confirmed this inverse relationship 45 . But in another study, they could not find a significant link between HCT and diabetes 46 .
We also found that like high WBC, high RBC and MCHC can also increase the risk of diabetes. As shown in the decision tree, it can be inferred that a decrease in RBC, lower than 4.73, can greatly decrease the risk of diabetes.
Similar to our results, a study of 87 bangal T2DM showed a correlation between high MCHC and RDW with T2DM 47 . However, the study carried out in Saudi Arabia on a population with T2DM showed a negative  www.nature.com/scientificreports/ association between diabetes and MCHC 48 . And so, this factor needs to be further investigated to determine its exact link to diabetes. According to the results of the study, we obtained that for each unit increase in RBC, the odds of having diabetes 1.64 times which indicates a strong effect of red blood cell count on the risk of T2DM. However, very few studies in the world have linked this factor, and most studies have only reported the effect of T2DM on changes in the appearance and properties of red blood cells [49][50][51] . Even a 2013 study by Zhan-Sheng Wang and Zhan-Chun Song, which examined the relationship between red blood cell count and its effect on microvascular complications in Chinese patients with T2DM, yielded conflicting results. It was found that the proportion of patients with microvascular complications increases with decreasing red blood cell count (p value below 0.001) 52 . Another study in India in 2019 examined the association and role of hematological factors in diabetes mellitus reported that poorly controlled diabetics were more likely to develop anemia 53 .
One of the most important strengths of our study is the large sample size used. The second strength is the wide age range used in the study, which easily includes the three age groups of young, middle-aged, and elderly, and examines this relationship in them. Also, in this study we examined a relatively large number of hematological factors and for some of these factors not many studies have been done globally yet.
One of the limitations of this study is that we did not measure HbA1c in participants of the MASHAD cohort study. Moreover, it would have been much better if we could have enriched the target community in terms of cultural diversity because our study population was adults in the Mashhad cohort who all live in a common geographical area with relatively similar customs and lifestyles. This makes it impossible to generalize the results of this study to the other countries or even the total population of Iran.
The results of this study can help health authorities in early diagnosis and prevention of diabetes by examining only a few simple hematological criteria.

Conclusion
Our study showed that the BF model showed a better performance for the prediction of T2DM than the DT and LR models. According to our results, it may be concluded that some of the hematological factors could be valuable tool in the prediction of T2DM such as WBC, PDW, RDW, RBC, PLT, MCHC, and HCT. Among these hematological factors, WBC had the most significant role in the prediction of T2DM. Our findings indicates that hematological factors can be of value for using in the health care setting to predict the T2DM, as they are cost-effective, accessible, and simple markers.

Data availability
Data sharing is not applicable to this article as no new data were created in this study. Further enquiries can be directed to the corresponding author.